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I.  Introduction 


Convexity  plays  a  central  role  in  a  wide  variety  of  machine  learning  and  statistical  inference  prob¬ 
lems.  A  standard  paradigm  is  to  distinguish  a  preferred  member  from  a  set  of  candidates  based 
upon  a  convex  impurity  measure  or  loss  function  tailored  to  the  specific  problem  to  be  solved. 
Examples  include  least  squares  regression,  decision  trees,  boosting,  online  learning,  maximum  like¬ 
lihood  for  exponential  models,  logistic  regression,  maximum  entropy,  and  support  vector  machines. 
Such  problems  can  often  be  naturally  cast  as  convex  optimization  problems  involving  a  Bregman 
distance,  which  can  lead  to  new  algorithms,  analytical  tools,  and  insights  derived  from  the  powerful 
methods  of  convex  analysis. 

In  this  paper  we  formulate  and  prove  a  convex  duality  theorem  for  minimizing  a  general  class 
of  Bregman  distances  subject  to  linear  constraints.  The  duality  result  is  then  used  to  derive 
iterative  algorithms  for  solving  the  associated  optimization  problem.  Our  presentation  is  motivated 
by  the  recent  work  of  Collins,  Schapire,  and  Singer  (2001),  who  showed  how  certain  boosting 
algorithms  and  maximum  likelihood  logistic  regression  can  be  unified  within  the  framework  of 
Bregman  distances.  In  particular,  specific  instances  of  the  results  given  here  are  used  by  Collins 
et  al.  (2001)  to  show  the  convergence  of  a  family  iterative  algorithms  for  minimizing  the  exponential 
or  logistic  loss. 

While  invoking  methods  from  convex  analysis  can  unify  and  clarify  the  relationship  between  differ¬ 
ent  methods,  the  higher  level  of  abstraction  often  comes  at  a  price,  since  there  can  be  considerable 
technicalities.  For  example,  in  some  treatments  the  assumptions  on  the  convex  functions  that  can 
be  used  to  define  Bregman  distances  are  very  technical  and  difficult  to  verify.  Here  we  trade  off 
generality  for  relative  simplicity  by  working  with  a  restricted  class  of  Bregman  distances,  which 
however  includes  many  of  the  examples  that  arise  in  machine  learning.  Our  treatment  of  duality 
and  auxiliary  functions  for  Bregman  distances  closely  parallels  the  results  presented  by  Della  Pietra 
et  al.  (1997)  for  the  Kullback-Leibler  divergence.  In  particular,  the  statement  and  proof  of  the 
duality  theorem  given  in  (Della  Pietra  et  al.,  1997)  carries  over  with  only  a  few  changes  to  the  class 
of  Bregman  distances  we  consider. 

Our  approach  differs  from  much  of  the  literature  in  convex  analysis  in  several  ways.  First,  we  work 
primarily  with  the  argument  at  which  a  convex  conjugate  takes  on  its  value,  rather  than  the  value 
of  the  function  itself.  The  reason  for  this  is  that  the  argument  corresponds  to  a  statistical  model, 
which  is  the  main  object  of  interest  in  statistical  or  machine  learning  applications,  while  the  value 
corresponds  to  a  likelihood  or  loss  function.  Second,  while  Bregman  distances  are  typically  defined 
only  on  the  interior  of  the  domain  of  the  underlying  convex  function,  we  assume  that  there  is  a 
continuous  extension  to  the  entire  domain.  This  makes  it  possible  to  formulate  a  very  natural 
duality  theorem  that  also  includes  many  cases  required  in  practice,  when  the  desired  model  may 
lie  on  the  boundary  of  the  domain. 

The  following  section  recalls  the  standard  definitions  from  convex  analysis  that  will  be  required,  and 
presents  the  technical  assumptions  made  on  the  class  of  Bregman  distances  that  we  work  with.  We 
also  introduce  some  new  terminology,  using  the  terms  Legendre-Bregman  conjugate  and  Legendre- 
Bregman  projection  to  extend  the  classical  notion  of  the  Legendre  conjugate  and  transform  to 
Bregman  distances.  Section  3  contains  the  statement  and  proof  of  the  duality  theorem  that  connects 
the  primal  problem  with  its  dual,  showing  that  the  solution  is  characterized  in  geometrical  terms 
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by  a  Pythagorean  equality.  Section  4  defines  the  notion  of  an  auxiliary  function,  which  is  used  to 
construct  iterative  algorithms  for  solving  constrained  optimization  problems.  This  section  shows 
how  convexity  can  be  used  to  derive  an  auxiliary  function  for  Bregman  distances  based  on  separable 
functions.  The  last  section  summarizes  the  main  results  of  the  paper. 


II.  Bregman  Distances  and  Legendre-Bregman  Projections 

In  this  section  we  begin  by  establishing  our  notation  and  recalling  the  relevant  notions  from  convex 
analysis  that  we  require;  the  classic  text  (Rockafellar,  1970)  remains  one  of  the  best  references 
for  this  material.  We  then  define  Bregman  distances  and  their  associated  conjugate  functions  and 
projections,  and  derive  various  relations  between  these  that  will  be  important  in  proving  the  duality 
theorem.  Next  we  state  our  assumptions  on  the  underlying  convex  function  that  enable  us  to  derive 
these  properties  for  the  continous  extension  of  the  Bregman  distance  to  the  entire  domain. 

A.  Notation  and  Basic  Definitions 

We  will  use  notation  that  is  suggestive  of  our  main  applications:  rather  than  <j>(x)  we  will  write 
<j)(p)  or  4>(q),  having  in  mind  probability  distributions  p  or  q.  A  convex  function  <j>  :  S  C  Mm  — > 
[—00,  +00]  is  proper  if  there  is  no  q  E  S  with  (p(q)  =  —00  and  there  is  some  q  with  cp(q )  00.  The 

effective  domain  of  rj>,  denoted  A^,  is  the  set  of  points  where  rp  is  finite:  A^,  =  {g£  S  \  cp(q)  <  00}; 
for  brevity  we  usually  refer  to  A^  as  simply  the  domain  of  cp.  A  proper  convex  function  is  closed  if 
it  is  lower  semi-continuous.  The  conjugate  rp*  of  cp  is  given  by 

ct>*(v)  =  sup  ((q,v)  -  cp(q))  (2.1) 

q  G  S 

A  proper  convex  function  </>  is  said  to  be  essentially  smooth  or  steep  if  it  is  differentiable  on  the 
interior  of  its  domain  int(A^)  0,  and  if  lim,woo  |V</>(gn)|  =  +00  whenever  qn  is  a  sequence  in 
int(A^)  converging  to  a  point  on  the  boundary  of  int(A^).  The  function  cp  is  said  to  be  coercive  in 
case  the  level  set  {q  £  S  \  cp(q)  <  c}  is  bounded  for  every  c  £  M. 

Definition  2.1.  Let  rp  be  a  closed,  convex  and  proper  function  defined  on  S  C  Rm,  such  that  <p 
is  differentiable  on  int(A^)  0.  The  Bregman  distance  D $  :  A^  x  int(A^)  — >  [0, 00)  is  defined  by 

D<f,(p,q )  =  4>ip)  ~  4>{q)  ~  (V<£(g),p-  q)  (2-2) 

Bregman  distance  can  be  interpreted  as  a  measure  of  the  convexity  of  cj).  This  is  easy  to  visualize 
in  the  one- dimensional  case:  drawing  a  tangent  line  to  the  graph  of  <j>  at  q,  the  Bregman  distance 
D(f,(p,  q)  is  seen  as  the  vertical  distance  between  this  line  and  the  point  rp(p). 

Legendre  functions  are  a  very  well  behaved  family  of  convex  functions  that  will  make  working  with 
Bregman  distances  much  easier. 

Definition  2.2.  A  closed  convex  function  6  is  Legendre,  or  a  convex  function  of  Legendre  type,  in 
case  int(A^)  is  convex  and  rp  is  essentially  smooth  and  strictly  convex  on  int(A^). 
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The  primary  properties  that  make  working  with  Legendre  functions  convenient  are  summarized  in 
the  following  results  quoted  from  (Rockafellar,  1970). 

Proposition  2.3.  (Rockafellar,  1970;  Theorem  26.5)  If  4>  is  a  convex  function  of  Legendre  type 
then  V(j)  :  int(A0)  — >  int(A^*)  is  a  bijection,  continuous  in  both  directions,  and  V</>*  =  (V(/>)-1. 

Proposition  2.4.  Suppose  that  4>  is  Legendre,  and  ij>  is  a  proper  closed,  convex  function  that  is 
also  essentially  smooth.  Then  j>  +  is  Legendre. 


In  particular,  since  for  fixed  geint(A^)  the  mapping  p  i — >  (V(j)(q),p  —  q)  +  4>(q)  is  affine  linear, 
the  function  p  i — >  D<^(p,  q)  is  Legendre  with  domain  A^,  and  with  conjugate  domain  A^*  —  V </>(<?). 

Definition  2.5.  For  <f>  a  convex  function  of  Legendre  type  we  define  the  Legendre-Bregman  con¬ 
jugate  :  int(A^)  x  Mm  — >  M  U  {oo}  as 

4(g,u)  =  sup  ((v,p)  -  D^(p,q))  (2.3) 

We  define  the  Legendre-Bregman  projection  C $  :  int(A^)  x  Mm  — >  A^  as 

C<p{q,v)  =  argrna x((v,p)  -  D^faq))  (2.4) 

p  £  A^, 


whenever  this  is  well  defined. 


Let  us  explain  our  choice  of  terminology  in  the  above  definition,  which  is  nonstandard.  When  h 
is  a  convex  function  of  Legendre  type,  the  Legendre  conjugate,  as  defined  in  (Rockafellar  1970; 
Chapter  26),  corresponds  to  the  convex  conjugate  h* .  For  fixed  q,  our  definition  of  the  Legendre- 
Bregman  conjugate  is  simply  the  classical  Legendre  conjugate  for  the  convex  function  h(p)  = 
D<f,(p,q).  Note  that  Rockafellar  defines  the  Legendre  transform  as  the  mapping  from  the  original 
convex  function  (and  domain)  to  its  Legendre  conjugate  (and  associated  domain).  The  Legendre- 
Bregman  projection  C<t,(q,  v )  is  the  actual  argument  at  which  the  maximum  is  attained.  As  shown 
by  the  following  result,  our  use  of  the  term  “projection”  accords  with  the  standard  terminology  of 
Bregman  projections. 

Proposition  2.6.  Let  <j)  be  Legendre.  Then  for  g£int(A0)  and  v6int(A^*)  —  Vj>(q),  the 
Legendre-Bregman  projection  is  given  explicitly  by 

C^q,v)  =  (V<f)(V<K  <?)  +  «)  (2.5) 

Moreover,  it  can  be  written  as  a  Bregman  projection 

£<t>(q,v)  =  argrnin  D^(j>,q)  (2.6) 

P  6  A^n H 

for  the  hyperplane  H  =  {p  G  Mm  |  (p,  v)  =  6}  with  b  =  (C<f>(q,  v),v). 
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Proof.  To  prove  the  first  statement,  note  that  a  stationary  point  p*  of  (p,  v)  —  Drp(p,  q )  must 
satisfy: 

V0(p*)  =  Vf(q)  +  v  (2.7) 

Since  is  Legendre,  for  V0(g)  +  v  £  int(A0*),  we  then  have  that 

p*  =  (V0)_1(V</>(gi)  +  v)  (2.8) 

=  (V0*)(V0(g)  +  u)  (2.9) 

using  the  fact  that  (V0)_1  is  well  defined  and  equal  to  V</>*  from  Proposition  2.3.  The  second 
statement  follows  from,  for  example,  the  results  in  Section  2.2  of  Censor  and  Zenios  (1997)  on 
projections  onto  hyperplanes,  noting  that  every  Legendre  function  is  zone  consistent.  □ 

In  our  formulation  of  the  duality  theorem,  it  is  the  Legendre-Bregman  projection  C<f,{q,v)  rather 
than  the  conjugate  function  £(p(q,v)  that  plays  a  central  role,  leading  to  a  natural  and  simple 
statement  of  convex  duality  for  Bregman  distances.  This  projection  corresponds  to  a  statistical 
model,  which  is  the  primary  object  of  interest  for  machine  learning  problems. 

B.  Basic  Relations 

We  now  derive  some  basic  algebraic  relations  between  D^,  and  £$.  These  relations  will  be 
important  in  establishing  the  geometrical  aspects  of  the  duality  theorem,  as  well  as  for  deriving 
auxiliary  functions.  In  order  to  free  us  from  having  to  specify  the  domain  of  l<f,  and  C^,  we  will  in 
the  following  assume  that  =  Wn. 

Proposition  2.7.  Let  4>  be  Legendre,  with  A<p*  =  Mm.  For  fixed  pG  A^,  D^fp,  C^{q,v))  is  con¬ 
tinuous  in  q  £  int(A^)  and  convex  in  v.  Together,  the  Legendre-Bregman  conjugate  and  projection 
satisfy 


D<t>ip ,  q)  -  D<t>(p,£<t>(q,v)) 


(v,p)  -  £^,(q,v) 

D{C(f>(q,  v),  q)  +  (v,p  -  Lffav)) 


(2.10) 

(2.11) 


for  all  p  £  A<£,  q  £  int(A^)  and  v  £  Mm. 


Proof  Using  the  definition  of  Bregman  distance  and  Proposition  2.6,  we  have  for  (/£int(A^) 

that 


D<t>{p,  q)  -  D<t>i;p,£<t>(q,v)) 

=  v))  -  4>{q)  +  {V(/)(C^{q,  v)),p-  C^q,  v))  -  {V(j>(q),p-  q)  (2.12) 

=  <i>{C’4>{<l,v))-<t>{<i)  +  {'y<l>{q)  +  v,p-C(i>{q,v))-{V4>{q),p-q)  (2.13) 

=  (v,p)  -  (v,  £</,(<?,  v))  + (fiC^q,  v))  -<j>(q)  -  {V4>{q),C^{q,  v)  -  q)  (2.14) 

=  (v,p)  -  (v,C<j>(q,v))  + D^C^q^)^)  (2.15) 

=  {v,p)-£<i>(q,v)  (2.16) 


4 


Therefore  (2.10)  holds  for  pE  and  gEint(A^).  From  the  definition  of  the  Legendre-Bregman 
conjugate  we  have  for  q  €  int(A^),  that 

U(q>v)  =  (v,£<i>(q,v))  -A/>(£</>((?,  «),<?)  (2.17) 

Equation  (2.24)  results  from  combining  this  with  (2.10).  The  convexity  of  D^(p,  £</>(<?,  u))  in  v 
follows  from  (2.10),  which  expresses  D^{p,C^{q,v))  as  a  sum  of  the  convex  functions  £(f>(q,v)  and 
D<p{p,q)  -  (v,p).  □ 

The  next  result  shows  how  D^p,  C^q,  v))  varies  with  v,  and  will  be  an  important  ingredient  in 
the  duality  result  of  the  next  section. 

Proposition  2.8.  Let  (j)  be  Legendre,  with  p  6  A^  and  q  EE  int(A^).  Then  for  v  €  Mm,  the  mapping 
t  i— >  D^p,  C^,(q,tv))  is  differentiable  at  t  =  0,  with  derivative 

D</>{p,£<t>{<l,  tv))  =  ( v ,  q)  -  (v,p)  .  (2.18) 

t= 0 


d_ 

dt 


Proof.  Since  (j)  is  Legendre,  if  (/Gint(A^)  then  C^,(q,tv)  eint(A^);  see  for  example  Theorem 
3.12  of  (Bauschke  &  Borwein,  1997).  From  Proposition  2.7  we  have  that 


d_ 

dt 


D<f>(p,  £<t>(q,  tv)) 

t= 0 


{(tv,  C<j)(q,  tv))  -  (tv,p)  +  D^C^q,  tv),q)) 


d_ 

dt 


(v,q)  ~  ( v,p )  +  ^ 

(v,q)  -  (v,p) 


4 K^(q,tv ))  -  (V< j)(q), 

t= o  at 


£<t>(q,tv)) 


(2.19) 

(2.20) 
(2.21) 


which  proves  the  result  for  p  EE  A^  and  q  £  int(A0).  n 


C.  The  Continuous  Extension 

The  results  above  are  given  in  terms  of  the  Bregman  distance  using  its  standard  definition  as  a 
function  on  A^,  x  int(A^).  We  now  make  assumptions  that  allow  us  to  work  with  D ^  as  an  extended 
real- valued  function  on  A^  x  A^.  This  enables  us  to  formlate  a  very  natural  and  general  duality 
result,  presented  in  the  following  section. 

Informally,  we  assume  that  D $  extends  continously  from  A<^  x  int(A0)  to  A^  x  A^,  and  that  C $ 
extends  continously  from  int(A^)  x  Mm  to  A^  x  Mm.  In  addition,  we  require  a  form  of  compactness 
to  guarantee  the  existence  of  certain  minimizers.  As  before,  in  order  to  simplify  the  presentation 
we  assume  that  the  range  of  V</>  is  all  of  Rm. 
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Thus,  we  make  the  following  assumptions  on  <f\ 


Al.  cj)  is  of  Legendre  type; 

A2.  =  Mm; 

A3.  extends  to  a  function  D ^  x  A^  — >  [0,  oo]  such  that  D(p(p.  q)  is  continuous  in  p  and 

q,  and  satisfies  D^p,  q)  =  0  if  and  only  if  p  =  q; 

A4.  extends  to  a  function  C ^  :  A^  x  Wn  —>  A^  satisfying  £<p(q,  0)  =  q,  such  that  £</,((/,  v )  and 
D(p(C(f)(q,v),q)  are  jointly  continuous  in  q  and  v. 

A5.  Dff,(p ,  •)  is  coercive  for  every  p  £  A<?<,\int(A^); 

Note  that  since  for  a  Legendre  function  V</>  is  continous  on  int(A^)  (Proposition  2.3),  it  follows 
from  Definition  2.1  that  -D^(p,  g)  is  jointly  continous  on  A^  x  int(A^)  and  from  Proposition  2.6 
that  £</,(g,  u)  is  jointly  continous  on  int(A^)  x  Mm .  We  also  note  that  since  we  assume  A^,*  =  Mm, 
D^p,-)  is  automatically  coercive  for  peint(A^).  Together,  conditions  A1-A5  imply  that  4>  is  a 
Bregman-Legendre  function  as  defined  by  Bauschke  and  Borwein  (1997).  Note  that  we  do  not  assume 
that  is  jointly  continuous  on  A^  x  A^. 

Now,  from  the  definition  of  the  Legendre-Bregman  conjugate  we  have 

=  (v,£<l>{q,v))  -  D^C^v)^)  (2.22) 

for  g£int(A0).  Properties  A4  and  A5  allow  us  to  define  :  A^  x  Mm  — >  M  as  the  continuous 
extension  of  :  int(A^)  x  Mm  — >  M,  satisfying  the  same  identity.  Thus,  the  Legendre-Bregman 
conjugate  £$(q,  v)  is  continuous  in  q,  continuous  and  convex  in  v,  and  satisfies  ^^(<7, 0)  =  0. 

Proposition  2.7  now  generalizes  to  the  continuous  extension. 

Proposition  2.9.  Let  <f>  satisfy  A1-A4.  For  fixed  p£  A^,  D^p,  /^(g,  u))  is  continuous  in  q  and 
convex  in  v.  Together,  the  Legendre-Bregman  conjugate  and  projection  satisfy 

D<j,{p,q)  -  D^PtC^v))  =  (v,p)  -  i^v)  (2.23) 

=  D(C<i,(q,v),q)  +  (v,p-  C<j,(q,v))  (2.24) 

for  all  p,  q  £  A^  and  v  £  Mm. 


Proof.  This  follows  directly  from  the  continuity  of  C<j,(q,v),  -D(£<^(g,  v),  q),  and  itj)(q,v).  □ 
The  differential  identity  in  Proposition  2.8  also  extends. 

Proposition  2.10.  Let  <f  satisfy  A1-A4,  and  let  p,q£  A^,  with  D^p,  q)  <  oo.  Then  for  v  £  Mm, 
the  mapping  1 1— >  Drp(p,  C^q^v))  is  differentiable  at  t  =  0,  with  derivative 

£</>(<?,  tv))  =  ( v ,  q)  -  (v,p)  .  (2.25) 

t= 0 


d_ 

dt 
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(2.26) 


Proof.  Let  gGint(A^).  From  Proposition  2.8  we  know  that 

D<l>(p,£<!>(q,tv))  =  (v,q)-(v,p) 

t= 0 

Thus,  since  (t  +  s)v)  =  C^C^q,  tv),  sv),  we  also  have  that 

~jfD{p-,  £<j>(q,  tv))  =  {v,C^,{q,tv))  -  (v,p)  (2.27) 

To  show  that  the  result  holds  when  q  G  A^\int(A0),  we’ll  use  a  fact  from  elementary  analysis:  if 
fn  — >  f  i  and  f'n  is  continuous  with  f'n{t)  — >  g(t)  uniformly  for  t  G  [a,b\,  then  g  is  continous  and 
f'(t)  =  g(t).  First,  let  gGint(A^)  and  p  ^  int(A^).  Suppose  ^nGint(A^)  with  pn  — >  p.  Since 
C<j,(q,tv)  Gint(A^),  we  have  from  the  above  calculation  that 

^D(pn,£4>(q,tv))  =  (u,  £</>(<?,  ^))  -  (v,pn)  — *  (v,  C^q,  tv))  -  (v,p)  (2.28) 

where  the  convergence  is  uniform  on  every  interval  [a,  b]  around  zero;  property  (2.25)  follows. 

Now  suppose  that  pG  A^,,  q  G  A^,\int(A^),  and  qn  G  int(A^)  with  qn  — ►  q.  Because  £(p(q,v)  is 
jointly  continous  in  q  and  v,  it  is  uniformly  continuous  on  every  compact  set  of  (q,  v).  In  particular, 
( v ,  £<^(gra,  tv))  —  (v,  p)  converges  uniformly  in  t  to  ( v ,  £^(g,  tv))  —  (v,  p)  on  every  interval  [a,  b] .  Thus 
property  (2.25)  holds  for  all  p,  q  G  A^.  n 

Proposition  2.9  and  2.10  are  the  main  computations  that  we  will  require  in  the  following  section. 


d_ 

dt 


III.  Duality 

In  this  section  we  derive  the  main  duality  result.  The  setup  is  that  we  have  a  set  of  features 
/(i)  GMm,  j  =  1,2,...  ,n  and  denote  by  F  the  m  x  n  matrix  with  columns  given  by  the 
These  features  correspond  to  the  “weak  learners”  in  boosting,  or  to  the  sufficient  statistics  in  an 
exponential  model.  The  primal  problem  constrains  the  values  (p,  /^),  and  these  constraints  carry 
over  to  Lagrange  multipliers  in  a  family  of  Legendre-Bregman  projections  £(q,  F A)  in  the  dual 
problem. 

Definition  3.1.  For  a  given  element  po  G  A^,  the  feasible  set  for  po  and  F  is  defined  by 

V{p0,F)  =  |pGA0  |  (p,/W)  =  (p0,/(j)),  j  =  1,...  n}  (3.1) 

For  a  given  go  G  A^,  the  Legendre-Bregman  projection  family  for  go  and  F  is  defined  by 

Q{q0,F)  =  {q  G  A^  |  g  =  £^(g0,  FA)  for  some  A  G  Mn}  (3.2) 

Trivially,  both  sets  are  non-empty  since  po  &  V(po,F)  and  go  G  Q{qo,F).  Since  po,  go,  and  F  will 
be  fixed,  we  will  use  abbreviated  notation  and  refer  to  these  sets  as  V  and  Q.  We  use  Q  to  denote 
the  closure  of  Q(qo,F)  as  a  subset  of  Mm.  Duality  relates  the  projection  onto  V  to  the  projection 
onto  Q. 
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Proposition  3.2.  Let  cf  satisfy  A1-A5,  and  suppose  that  po ,  go  £  with  D^po,  go )  <  oo.  Then 
there  exists  a  unique  g*  £  satisfying  the  following  four  properties: 

(1)  g*  £  P  H  <2 

(2)  D<p(p ,  g)  =  Z^(p  g*)  +  £></,(<?*,  g)  for  any  p  £  P  and  q  £  Q 

(3)  g*  =  arg  min  D^p,  q0) 

pdzV 

(4)  g*  =  arg  min  D^po,  q) 

q  £Q 

Moreover,  any  one  of  these  four  properties  determines  g*  uniquely. 

In  order  to  prove  Proposition  3.2,  we  first  prove  two  lemmas.  The  first  shows  that  there  is  at  least 
one  member  in  common  between  V  and  Q;  the  second  shows  that  the  Pythagorean  equality  (2) 
holds  for  any  such  member. 

Lemma  3.3.  If  D(j,(po,  go)  <  oo  then  V  (1  Q  is  nonempty. 

Proof.  Note  that  since  D^po,  go)  <  oo,  D(p(po,  q)  is  not  identically  oo  on  Q.  Also,  the  mapping 
A  D^po,  £<p(qo,  F A))  is  continuous  and  convex.  Let  TZ  be  the  level  set 

TZ  =  {qeA(j}\D(j)(p0,q)<D(p0,q0)}  (3.3) 

We  know  from  Assumption  A5  that  TZ  is  bounded.  Thus  D^pq^)  attains  its  minimum  at  a  (not 
necessarily  unique)  point  g*  £  Q  fl  TZ  C  Q.  We  will  show  that  g*  is  also  in  V. 

Let  g  £  Q,  and  let  pj  £  Mn  be  such  that  g  =  lirn^oo  C(p(qo,  Fpj).  Then  by  the  continuity  of  £</,(-,  •), 

£^(g,FA)  =  lirn  £^(£^(go,F/ij),PA)  (3.4) 

j->° o 

=  lim  £<^(go,  F(gj  +  A))  £  Q  (3.5) 

j^oo 

Thus  Q  is  closed  under  the  mapping  q  t— ►  C(p(q,  FX)  for  A  £  Mm,  and  £<^(g*,  F A)  is  in  Q  for  any  A. 
By  the  definition  of  g*,  it  follows  that  A  =  0  is  a  minimum  of  the  function  A  i— >  D(j)(po ,  £<^(g*,  FX)). 
Taking  derivatives  with  respect  to  A  and  using  Proposition  2.10  we  conclude  that  (g*,  /)  =  (po,  /); 
thus  g*  £  V.  n 

Lemma  3.4.  If  g*  £  V  n  Q  then  the  Pythagorean  equality  D^p ,  g)  =  D(f,(p,  g*)  +  D^(g*,  g)  holds 
for  any  p  £  V  and  q  £  Q. 


Proof.  Suppose  that  pi,P2,  Qi,  42  £  A^  with  g 2  =  C^(q\,FX).  From  Proposition  2.9  we  have 

that 

Qi)  ~  D^ipi,  g2)  =  (pi,  FX)  -  ^{pi,  FX)  (3.6) 


and  similarly 


D<j,(p2,qi)  ~  D<j>(p2,q2)  =  (P2,  FX)  -  £<t>(p2,  FX) 


(3.7) 


Therefore, 

D^ipuqi)  -  D(f,{p1,q2)  -  D(f>(p2,qi)  +  D<j)(p2,q2)  =  (pi,  FX)  -  (p2,  FX)  (3.8) 

n 

=  ((Puf^)  -  (P2,f^)^ 

3= 1 

It  follows  from  this  identity  and  the  continuity  of  D $  that 

D<f,{p i,  q i)  -  D^(p i,  q2)  ~  D^(p2,  q i)  +  D^(p2,  £2)  =  0  (3.9) 

if  pi,p2  €V  and  qi,  q2  G  Q.  The  lemma  follows  by  taking  p\  =  q\  =  g*.  □ 

Proof  of  Proposition  3. 2.  Choose  g*  to  be  any  point  in  V fl  Q .  Such  a  g*  exists  by  Lemma  3.3. 
It  satisfies  property  (1)  by  definition,  and  it  satisfies  property  (2)  by  Lemma  3.4.  As  a  consequence 
of  property  (2),  it  also  satisfies  properties  (3)  and  (4).  To  check  property  (3),  note  that  if  q  is  any 
point  in  Q,  then  D^po,  q)  =  D^po,  g*)  +  Z^(g*,  q)  >  D^po,  g*).  Similarly,  property  (4)  must  hold 
since  if  p  is  any  point  in  V,  then  D^p,  q0)  =  A/,(p,  g*)  +  D^(g*,  g0)  >  D(qin  g0) . 

It  remains  to  prove  that  each  of  the  four  properties  (1)— (4)  determines  g*  uniquely.  In  other  words, 
we  need  to  show  that  if  m  is  a  point  in  satisfying  any  of  the  four  properties  (1)— (4),  then 
m  =  g*.  Suppose  that  m  satisfies  property  (1).  Then  property  (2)  with  p  =  q  =  m  implies  that 
D^m^m)  =  D^m,  g*)  +  -D^(g*,  mn).  Since  =  0,  it  follows  that  D^(m,g*)  =  0  so  m  =  g*. 

If  m  satisfies  property  (2),  then  the  same  argument  with  g*  and  m  reversed  proves  that  m  =  g*. 
Suppose  that  m  satisfies  property  (3).  Then 

D^ipo,  q*)  >  D^po,  m )  =  D^po,  g*)  +  D^q*,  m )  (3.10) 

where  the  second  equality  follows  from  property  (2)  for  g*.  Thus  L>^(g*,?n)  <  0  so  m  =  g*.  If  m 
satisfies  property  (4),  then 

£^(g*,  qo)  >  D^m,  g0)  =  D^m,  g*)  +  -D</,(g*,  g0)  (3.11) 

showing  once  again  that  m  =  g*.  □ 

In  the  following  section  we  outline  the  auxilary  function  method  for  building  iterative  algorithms  to 
compute  g*,  and  show  how  to  use  convexity  to  derive  an  auxiliary  function  for  separable  Bregman 
distances. 

IV.  Auxiliary  Functions 

The  auxiliary  function  approach  is  conceptually  simple:  bound  the  change  in  Bregman  distance 
from  below  using  a  function  that  is  easy  to  compute  and  that  decouples  the  constraints.  Maximizing 
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this  auxiliary  function  we  obtain  new  parameters  A'  =  A  +  AA  and  a  new  model  (/a+aa  given  by 

Qx+ax  =  C,p(q\,FAX)  (4.1) 

=  C^C^qo,  FX),FAX)  (4.2) 

=  C(j)(qo,F(X  +  AA))  (4.3) 

We  then  use  the  duality  theorem  to  show  that  when  A  A  =  0,  we  must  have  that  q  =  q*. 

The  strategy  is  very  similar  to  EM.  In  an  EM  algorithm,  the  Q-function  Q{ X' ,  A)  is  computed  as  a 
lower  bound  to  the  change  in  log-likelihood: 

3  q(x\X) 


> 

def 

where  ( x ,  h )  is  the  complete  data,  x  is  th< 
concavity  of  the  logarithm.  After  computing  Q  in  the  E-step,  it  is  then  maximized  over  X'  in  the 
M-step. 

In  the  same  way,  for  Bregman  distances  the  aim  is  to  derive  an  auxiliary  function  A(Ar,A)  whose 
calculation  can  be  carried  out  efficiently  in  something  like  an  E-step,  and  such  that  it  can  be  easily 
maximized  over  X'  in  an  M-step.  However,  just  as  for  EM,  this  is  a  general  strategy  more  than  it  is 
a  precise  algorithm.  A  particular  Bregman  distance  problem  may  require  some  ingenuity  in  order 
to  come  up  with  an  appropriate  auxiliary  function. 

This  is  the  general  motivation  behind  the  following  two  definitions. 

Definition  4.1.  A  function  A  :  A ^  x  Mn  — >  M  is  called  an  auxiliary  function  for  po  and  F  in  case 

1.  A(q ,  A)  is  continuous  in  q  and  A(q ,  0)  =  0 

2.  D^po,  q)  —  D^po,  C^q,  FX))  >  A(q,  A) 

3.  If  A  =  0  is  a  maximum  of  A(q,  A),  then  ( q ,  /W)  =  /^)  for  j  =  1, . . . ,  n. 

Definition  4.2.  Let  A  he  an  auxiliary  function  and  q^^A^.  The  update  sequence  for  q$  with 
respect  to  A  is  defined  hy  =  q0  and 

q(t+ 1)  =  £0(g(t),LAW)  where  X^  =  argmaxA  A(qi't\  A)  (4.8) 

The  reason  for  defining  auxiliary  functions  in  this  way  is  the  following  result,  which  can  be  proved 
in  a  similar  way  to  Proposition  5  in  (Della  Pietra  et  al.,  1997)  or  Lemma  2  in  (Collins  et  al.,  2001). 
The  compactness  assumption  will  in  general  follow  from  coercivity  in  Assumption  A5. 


^2  'Po{x)  log 


Eft  q{x,h\X’) 
q(x  |  A) 


E«.(x)iogE9(Mx.A)iAMAl 

E(AKA,log:7 Tny 


(4.4) 

(4.5) 

(4.6) 


Q(A',A)  (4.7) 

;  incomplete  data,  and  the  inequality  follows  from  the 
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(4.9) 


Proposition  4.3.  Suppose  that  the  sequence  gW  lies  in  a  compact  set.  Then 

lim  q-t}  =  argmin  DApo,  q) 

» OO  pr 

q  GQ 


As  we  now  explain,  auxiliary  functions  can  be  conveniently  constructed  by  using  the  relation 

D^p,  q)  -  D^p ,  C^q ,  v))  =  (p,v)  -  (4.10) 

from  Proposition  2.9  and  exploiting  the  convexity  of  t ,p(q,v ).  For  simplicity,  we  will  assume  that 
4>  is  separable ,  so  that  <f)(p)  =  4)i(pi)  with  each  fa  :  R  — ►  R  satisfying  properties  A1-A5  (with 

m  =  1).  Auxiliary  functions  in  the  general  case  can  be  derived  using  similar  arguments.  For  the 
separable  case,  clearly 

m 

D<t>(p,q)  =  ^D^pi.qi)  (4.11) 

i=  1 
m 

U(q^v)  =  (4.i2) 

i= 1 

where  tfcfq,  v)  =  supp  g  a  ,  ( Pv  ~  Dfc (p.  q))  is  the  Legendre-Bregman  conjugate  of  fa. 


Proposition  4.4.  For  each  i  =  1 , . . . ,  m,  select  Ni  so  that  ]C”=1  \f^'\  —  A,  and  set  Sij  = 
sign (f^).  Then 

n  m  1  n 

Aq,  A)  =  J2  Xi  (Po'  f®)  “  Jfa.  SijNiXj )  (4.13) 

3= 1  i-  1  1  3  =  1 

is  an  auxiliary  function  for  po  and  F,  and  the  corresponding  update  scheme  is  given  by 

gd+F  =  £0(gW,F  A^)  (4.14) 

where  satisfies 

m 

2Z  sijNA)  =  (po,  f(j))  (4-15) 


The  idea  of  factoring  out  the  signs  stj  is  taken  from  Collins  et  al.  (2001).  If  we  take  A)  =  1  for 
all  i  we  obtain  the  algorithms  presented  in  that  paper.  If  we  take  Nt  =  ^  .  |/(^|  then  we  obtain 
algorithms  analogous  to  the  IIS  algorithm  of  Della  Pietra  et  al.  (1997). 

Proof.  We  verify  that  the  function  defined  in  (4.13)  satifies  the  three  properties  of  Defini¬ 
tion  4.1.  Property  (1)  of  the  definition  holds  since  7^  (g,;,  0)  =  0.  Property  (2)  follows  from  the 
convexity  of  irpi .  In  particular,  we  have  the  inequality 

m 

4(g,FA)  =  £ **(<&,  (FA)*)  (4.16) 

1=1 
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(4.17) 


—  ( Qj  i  j  i  Fjj  i  a  j ) 

i=l  j=l 

^(«>°) 

1=1  *  / 

m  n  -  „  , 

=  tJ  sjjNjXj)  (4.i9) 

*=i  1=1 1 1 

which  together  with  Proposition  2.9  says  that 

n  m  n  i  p  i 

D<fr{p,q)  -  D^pX^q^ A))  >  ^  (p,  ^  ^  -^4  ^(ft,  SijNiXj)  (4.20) 

l=i  *=i  1=1  J  * 

Now,  using  Propositions  2.9  and  2.10,  it  can  be  shown  that 

.;w4i  ('■/;•''■•)  =  £&(*,«)  (4.21) 

which  shows  that  (4.15)  is  the  correct  update.  Therefore,  at  a  maximum  A*  of  A(q,  A)  we  have  that 

m 

SijNiXj)  for  each  j  (4.22) 

i=  1 

If  A*  =0,  then 

m 

(poJ{j))  =  0)  =  (4.23) 

i= 1 

showing  that  Property  (3)  holds.  Thus  (4.13)  defines  an  auxiliary  function.  □ 

While  the  auxiliary  function  (4.13)  looks  a  bit  messy,  both  the  “E-step”  and  “M-step”  for  this  type 
of  auxiliary  function  are  generally  quite  practical  and  easy  to  implement. 


< 


(%>  SijNiXj')  T-  1 


V.  Conclusion 

This  paper  has  presented  a  convex  duality  theorem  for  constrained  optimization  using  Bregman 
distances.  The  main  result,  Proposition  3.2,  differs  from  results  presented  in  the  convex  analysis 
literature  in  that  the  Bregman  distance  is  defined  on  the  entire  essential  domain,  rather  than  only 
on  the  interior.  This  generality  is  needed  in  many  applications.  Though  the  assumptions  A1-A5 
that  we  make  on  the  underlying  convex  function  are  fairly  restrictive,  it  may  well  be  possible  to 
relax  these  assumptions  to  cover  a  broader  class  of  examples.  In  particular,  assumption  A2,  which 
states  that  the  conjugate  domain  is  all  of  Rm,  may  not  be  essential  in  our  approach. 

The  auxiliary  function  technique  presented  in  Section  4  is  a  general  and  practical  method  for 
deriving  algorithms  for  solving  the  dual  problem.  Although  the  specific  auxiliary  function  we 
derive  assumes  the  Bregman  distance  is  separable,  similar  arguments  can  be  used  for  non-separable 
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Bregman  distances.  The  analysis  given  here  makes  clear  the  role  of  convexity,  as  the  bounds  are 
derived  using  only  the  properties  of  the  underlying  Legendre-Bregman  conjugate. 
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